[QNN-EP] Implement file mapped weights feature#26952
[QNN-EP] Implement file mapped weights feature#26952edgchen1 merged 19 commits intomicrosoft:mainfrom
Conversation
|
The observation is not entirely true. ORT memory map external weights. You have an ability to request a weight as an ORT Value from the EP. If the weight is external it will be memory mapped. See |
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_file_mapping_callback_interface.h
Outdated
Show resolved
Hide resolved
|
I suggest not to implement QNN specific mapping, but re-use code in ORT. |
|
Discussed offline, the EP maps initializers form the binary context, not from the external weights files. |
- Create file mapping callback interface class - Android expected to have support in the future - Implement Windows callbacks in WindowsFileMapper - New option disable_file_mapped_weights - Feature is enabled by default with retry logic
f55dc78 to
2e451ae
Compare
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
|
Please, avoid force pushes. |
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
|
/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Please, comment on all Copilot review issues before resolving them. |
onnxruntime/core/providers/qnn/builder/qnn_file_mapping_callback_interface.h
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
|
/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 2 pipeline(s). |
QnnBackendManager::SetupBackend if file mapping is not available
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
onnxruntime/core/providers/qnn/builder/qnn_file_mapping_interface.h
Outdated
Show resolved
Hide resolved
|
/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 2 pipeline(s). |
onnxruntime/core/providers/qnn/builder/qnn_windows_file_mapper.cc
Outdated
Show resolved
Hide resolved
Make file mapping callbacks more thread safe Do not destruct file_mapper_ until session destruction
|
/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline |
|
Azure Pipelines successfully started running 2 pipeline(s). |
and unnecessary functions relating to file_mapped_weights_enabled_
|
/azp run Linux QNN CI Pipeline,Windows ARM64 QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Description Enables the file mapping of weights as well as the overall context bin. This feature is currently only enabled for ARM64 WIN devices Motivation and Context Currently, when reading the context bin, ORT allocates a large buffer on the heap. Assuming the same model is used, each ORT session will allocate a buffer for the context bin. This is incredibly wasteful when large models are used. Instead, WIN file mapping can be leveraged to map the context bin, then every time a context needs to be created with the context bin, the pointer to the context bin can be retrieved and used instead of some pre-allocated buffer, thus making QNN EP more memory-efficient. In the case of multiple ORT sessions, the context bin will only be loaded once for all sessions, increasing memory efficiency and overall initialization performance. This is very useful regarding the use of LLMs going forward. --------- Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
| Commit | Commit Title | Author | | :--- | :--- | :--- | | `11dde2d9e` | [NV TensorRT RTX EP] Fix external tensorrt_plugins load path (#26814) | keshavv27 | | `080d96818` | Move model compatibility checks ahead of session initialization (#27037) | adrastogi | | `ec4f6bfa1` | [QNN EP] Fix error messages being logged as VERBOSE instead of ERROR (#24931) | Copilot | | `0432e7125` | perftest: support plugin eps for compile_ep_context (#27121) | Jaskaran Singh Nagi | | `727db0d3d` | Engine compatibility validity API implementation (#26774) | umangb-09 | | `27013522f` | Deprecate transformers model examples (#27156) | Jambay Kinley | | `f83d4d06e` | [QNN-EP] Implement file mapped weights feature (#26952) | quic-calvnguy | --------- Co-authored-by: keshavv27 <165012837+keshavv27@users.noreply.github.com> Co-authored-by: adrastogi <aditya.rastogi@microsoft.com> Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: vraspar <51386888+vraspar@users.noreply.github.com> Co-authored-by: yuslepukhin <11303988+yuslepukhin@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: umangb-09 <umangb@nvidia.com> Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com> Co-authored-by: quic-calvnguy <quic_calvnguy@quicinc.com> Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
Description Enables the file mapping of weights as well as the overall context bin. This feature is currently only enabled for ARM64 WIN devices Motivation and Context Currently, when reading the context bin, ORT allocates a large buffer on the heap. Assuming the same model is used, each ORT session will allocate a buffer for the context bin. This is incredibly wasteful when large models are used. Instead, WIN file mapping can be leveraged to map the context bin, then every time a context needs to be created with the context bin, the pointer to the context bin can be retrieved and used instead of some pre-allocated buffer, thus making QNN EP more memory-efficient. In the case of multiple ORT sessions, the context bin will only be loaded once for all sessions, increasing memory efficiency and overall initialization performance. This is very useful regarding the use of LLMs going forward. --------- Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
Description
Enables the file mapping of weights as well as the overall context bin. This feature is currently only enabled for ARM64 WIN devices
Motivation and Context
Currently, when reading the context bin, ORT allocates a large buffer on the heap. Assuming the same model is used, each ORT session will allocate a buffer for the context bin. This is incredibly wasteful when large models are used. Instead, WIN file mapping can be leveraged to map the context bin, then every time a context needs to be created with the context bin, the pointer to the context bin can be retrieved and used instead of some pre-allocated buffer, thus making QNN EP more memory-efficient. In the case of multiple ORT sessions, the context bin will only be loaded once for all sessions, increasing memory efficiency and overall initialization performance. This is very useful regarding the use of LLMs going forward.